1. TEXT CLASSIFICATION: IMDB data

Load data

Preprocessing

GloVe

First, we load the pretrained Glove Vectors of dimensionality of 100 (https://nlp.stanford.edu/projects/glove/) into a dictionary where we can make the lookup for the specific glove vector dictionary[word]

Cleaning the text for GloVe:

A clean function to remove special characters and make everything lowercase is defined

Converting the words to vectors and take the average for each movie plot

PCA of movie plots

Now that we have calculated the global vectors for the movie plots, we can calculate the principal components.

We use the PCA class imported from sklearn, but we also demonstrate how it can be calculated without the library.

Making use of sklearn

Without sklearn

Classify genres based on the first n principal components of GloVe vectors

FastText Classification using 3-gram embeddings

Classification on test set

PCA on FastText embeddings

First, we get the FastText embeddings for each text from the model

Now we run PCA on this and plot the explained variance ratio

The principal components PC1 and PC2, PC2 and PC3, and PC3 and PC4 are shown in a plot

Write your own plot and see what genre GloVe+PCA+classifier and FastText guess it is

2. SENTIMENT ANALYSIS: The Donald

Calculate window-wise sentiment and have the window size and stride as variables that can easily be changed

Plot the sentiment over time & apply smoothing filter